1.
overview and scope of application
applicable objects: block storage, object storage and file service nodes deployed in the singapore region (such as aws ap-southeast-1, alibaba cloud singapore, etc.).
goal: ensure availability, predictable capacity, operationalization and automation of alarms. this article uses prometheus/grafana/alertmanager as an example monitoring stack, and includes actual expansion and temporary processing steps.
2.
monitoring item collection and deployment steps (instance level)
steps: 1) install node_exporter on each storage server: sudo apt update && sudo apt install -y prometheus-node-exporter.
2) configure prometheus scrape: add - job_name: 'nodes' static_configs: - targets: ['ip:9100'] to prometheus.yml and restart prometheus. sudo systemctl restart prometheus.
3) collection items: disk usage (/, /data), inode usage, disk latency (iostat or node_exporter disk_latency), network bandwidth, cpu, memory, disk queue length, number of file handles.
3.
object storage and gateway monitoring
steps: 1) for s3-compatible storage, turn on the access log on the storage side, push it to a dedicated bucket and parse it with fluentd/fluent bit and report it to prometheus or send it directly to elasticsearch.
2) key indicators: put/get 4xx/5xx rate, 95/99p response delay, sharding/replication delay, object number growth rate, life cycle hot/cold times.
4.
alarm rules and threshold recommendations (example)
example prometheus rules: 1) disk_usage_percent > 80 for 5m → warning; >90 for 2m → critical.
2) inode_usage > 90% for 5m. 3) disk_io_avg_latency_ms > 50ms for 5m. 4) s3_5xx_rate > 0.5% for 10m.
rule writing reference: alert: diskalmostfull expr: (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) * 100 < 20
5.
alarm routing and receiver configuration
steps: 1) configure routes in alertmanager: route to slack/email/pagerduty/sms by severity, team, and service classification.
2) configure templates and suppression rules (snooze): short-term i/o peaks can be suppressed for 15 minutes.
3) test process: use amtool or curl to trigger a simulated alarm and confirm receipt and carbon copy.
6.
alarm handling (runbook) and quick handling commands
general process: receive an alarm → log in to the affected host → check top/df -h/iostat/vmstat → determine whether it is a sudden increase or a long-term increase.
quickly free up space: 1) clean /var/log: sudo journalctl --vacuum-time=3d; 2) clean temporary directories: sudo rm -rf /tmp/*; 3) delete old backups or migrate to cold storage (example: aws s3 mv /backup s3://cold-bucket --storage-class glacier).
temporary solution for capacity expansion: mount a new disk, rsync the data to the new disk, and update fstab.
7.
capacity planning steps (detailed how-to guide)
1) data collection: export daily used_bytes, object_count, daily_ingest_bytes for the past 90-180 days; you can use prometheus or cloud monitoring api (aws cloudwatch) to export csv.
2) calculate the daily growth rate: use linear regression or find the average daily increment of the last 30 days = (last - first)/days.
3) forecast and safety factor: take 95% of the forecast based on business peaks, and add strategic redundancy of 20%-30% (up to 50% for key businesses).
4) develop a retention and tiering policy: hot storage for 30 days, cold storage for 90-365 days and enable automatic transfer of life cycle rules. documented and registered in cmdb.
8.
capacity expansion operation (block storage/cloud disk and file system)
cloud disk expansion (taking aws as an example): 1) aws ec2 modify-volume --volume-id vol-xxx --size 200 --region ap-southeast-1.
2) check on the instance: sudo lsblk, if you need to expand the partition: sudo growpart /dev/xvdf 1; then expand the file system: for xfs sudo xfs_growfs /mountpoint; for ext4 sudo resize2fs /dev/xvdf1.
add a new disk and migrate: mount the new disk → rsync -av /data/ /mnt/newdata/ → modify fstab → restart the service and gradually switch.
9.
q&a 1
question: how to prevent abnormal 5xx alarms of object storage from being falsely reported in the singapore region?
answer: the key is to set short-term suppression and percentage thresholds: use the 5xx request rate (5xx_count / total_requests) as an indicator, and configure a threshold such as >0.5% for 10 minutes as an alarm. at the same time, false alarms caused by short-term deployment are suppressed (silent when deploy_tag=true), and the request delay and back-end error rate are combined to determine whether it is a real fault.
10.
q&a 2
question: what historical window is more accurate for capacity forecasting?
answer: a window of 90 to 180 days is usually used to take into account seasonality and recent trends. for rapidly growing businesses, the 30-day growth rate and the 90-day growth rate can be calculated in parallel, taking conservative values and retaining 20%-30% redundancy. temporary adjustments are required when there are promotions or migration windows.
11.
question 3
question: what should be the first step when the disk suddenly receives a high io alarm?
answer: the first step is to check the traffic and process: log in to the host and execute iostat -x 1 5, iotop, ps aux --sort=-%cpu to determine whether it is caused by backup/scan/batch processing; if it is an expected task, prioritize speed limiting or migration tasks; if it is an abnormal write, find the large file generator and temporarily stop the service. if necessary, remove the hot data to the cold disk.

- Latest articles
- Malaysia Cn2 Access Guide Covers Line Selection, Bandwidth Configuration And Optimization Strategies In Detail
- Operation And Maintenance Manual What Are The Monitoring Alarms And Capacity Planning Recommendations For Singapore Cloud Storage Servers?
- How To Choose A Suitable American Game Server Host To Ensure Stable Gaming
- How To Establish Supply Chain And Partnership In Qoo10 Japan Website Seller Communication Group Wechat
- How To Implement Cost-saving Techniques In The Unlimited Use Of Vps In Malaysia
- Preferential Activity Express Vietnam Vps Official Website Entrance Investment Promotion And Limited Time Discount Guide
- Competitive Product Monitoring And Price War Response Strategies In The Wechat Seller Communication Group Of Qoo10 Japanese Website
- A Collection Of Real-life Experiences Among Gamers Discussing Whether Qiyou Cloud Server Can Be Used In Japan
- The Stability And Expansion Strategy Of The American Cn2 Independent Server In High Concurrency Scenarios
- Analysis Of The Advantages Of Korean Private Vps In Terms Of Data Security And Independent Ip
- Popular tags
-
Sharing Of Experience, Which Vps In Singapore Is Better?
this article shares the experience of using vps in singapore and recommends useful vps service providers to help you choose the right server. -
Singapore Vps Evaluation Compares The Advantages And Disadvantages Of Major Service Providers
in-depth analysis of the advantages and disadvantages of major vps service providers in singapore to help users choose the most suitable vps service. -
Take A Closer Look At The Calculation Method Of Subsequent Renewal And Hidden Fees Of Singapore Vps Discounts
analyze the difference between singapore vps first-time discount and renewal prices, list common hidden fees and calculation formulas, provide identification and avoidance suggestions, and teach you how to deal with renewal price increases and bill disputes.